fix(benchmark): assign unique color per model in scatter plots by ttlequals0 · Pull Request #231 · ttlequals0/MinusPod

ttlequals0 · 2026-05-16T02:50:05Z

Summary

Three benchmark scatter renderers in benchmarks/llm/src/benchmark/report.py (_render_pareto, _render_precision_recall_chart, _render_token_efficiency_chart) were assigning colors with cmap = plt.get_cmap("tab20") then cmap(i % 20). With more than 20 models in a sweep, the legend cycled colors so multiple models shared the same dot color. The current report has 28 models, so the Cost-vs-F1 chart had several visibly duplicate colors.
Added _distinct_colors(n) that concatenates tab20 + tab20b + tab20c (60 categorical colors with good perceptual contrast) and falls back to evenly-spaced hsv past that, so every model gets a unique color up to 60 and remains unique past 60. Magic number is derived from len(palette), not hardcoded.
All three renderers switched to colors = _distinct_colors(len(points)) and colors[i].
Regenerated the three affected SVGs (pareto.svg, precision_recall.svg, token_efficiency.svg). Other charts color by value semantics (threshold bands, heatmaps, p50/p90/p99 categories), not model identity, so they were not affected and were not regenerated to keep this diff focused.
Benchmark-only change. The benchmark tree is dockerignored; no runtime image impact, no version bump, no openapi change.

Test plan

Verified _distinct_colors(n) returns n unique RGBA tuples for n in {5, 30, 60, 80, 200} (60 from categorical, >60 from hsv).
Regenerated report locally with benchmark report; SVGs render and the three target scatters show distinct colors per model.
/simplify and /code-review ran clean against the diff (no findings above the 80-confidence threshold).
CI green on this branch.

Three benchmark scatter renderers (Pareto, precision/recall, token efficiency) drew from tab20 with `i % 20`, so any run with more than 20 models silently reused colors. With 28 models in the current report the legend now had multiple identical dots. Add `_distinct_colors(n)` that concatenates tab20 + tab20b + tab20c (60 categorical colors) and falls back to evenly-spaced hsv past that. All three renderers now call it and index `colors[i]`. The three regenerated SVGs in this commit reflect the fix. Benchmark-only; not shipped in the runtime image; no version bump.

Re-renders report.md plus every SVG in results/report_assets/ from the current calls.jsonl + corpus, so the published artifacts match the rest of the repo's snapshot pattern. The substantive change is still only the three scatter plots (pareto, precision_recall, token_efficiency) where each model now gets a unique color; the other SVGs and report.md only differ in matplotlib-generated element IDs, embedded timestamps, and PASS-row ordering for ties.

Two HIGH-severity findings from the security audit on main (#231 follow-up). 1. Rotate Flask session on /auth/login and /auth/password. Previously `session['authenticated']=True` mutated the existing signed cookie. Any pre-login token (XSS on a sibling subdomain, MITM with insecure transport) would retain the same identifier post-auth. `session.clear()` before re-setting `permanent=True` and `authenticated=True` rotates the cookie so a pre-auth token cannot ride the new authenticated state. SameSite=Strict + Secure already mitigate but do not eliminate the fixation risk. 2. Refuse to start when SECRET_KEY mint cannot be coordinated. Previously get_or_create_secret_key() would warn-and-continue when the flock fallback failed; if the DB write then also failed, two gunicorn workers could mint divergent keys and silently invalidate each other's cookies. Now: if lock_fd is None AND the DB write raises, we raise a new SecretKeyUnavailableError that fails startup. Operators can bypass by setting SECRET_KEY via env. Behavior change: deploys with no SECRET_KEY env AND a broken data dir AND a failing DB will now refuse to boot instead of starting with a divergent key. This is intentional.

…splits) (#232) * fix(security): rotate session on login + fail loud on missing secret key Two HIGH-severity findings from the security audit on main (#231 follow-up). 1. Rotate Flask session on /auth/login and /auth/password. Previously `session['authenticated']=True` mutated the existing signed cookie. Any pre-login token (XSS on a sibling subdomain, MITM with insecure transport) would retain the same identifier post-auth. `session.clear()` before re-setting `permanent=True` and `authenticated=True` rotates the cookie so a pre-auth token cannot ride the new authenticated state. SameSite=Strict + Secure already mitigate but do not eliminate the fixation risk. 2. Refuse to start when SECRET_KEY mint cannot be coordinated. Previously get_or_create_secret_key() would warn-and-continue when the flock fallback failed; if the DB write then also failed, two gunicorn workers could mint divergent keys and silently invalidate each other's cookies. Now: if lock_fd is None AND the DB write raises, we raise a new SecretKeyUnavailableError that fails startup. Operators can bypass by setting SECRET_KEY via env. Behavior change: deploys with no SECRET_KEY env AND a broken data dir AND a failing DB will now refuse to boot instead of starting with a divergent key. This is intentional. * chore: phase-1 audit cleanups (compose, docker, hooks, ASCII, imports, dead alias) Bundle of low-risk hygiene from the 2026-05-16 codebase audit. No behavior change to any user-visible flow. Container / infra - docker-compose.yml + docker-compose.cpu.yml: add security_opt: ["no-new-privileges:true"] and cap_drop: [ALL] to the minuspod service. App already drops to UID 1000 via setpriv; nothing needs caps after start. - Dockerfile + Dockerfile.cpu: pin pip==25.2 and setuptools==80.9.0 for reproducibility (previously --upgrade with no floor). - .githooks/pre-commit: tighten the Bearer regex post-filter to strip obvious placeholder hits (example/placeholder/test/your-/fixture/ sample/dummy/fake/redacted), document the line-oriented multi-line limitation. Also add an ASCII-glyph block that fails commits introducing em-dashes, smart quotes, or U+2605 stars. - .github/workflows/*.yml: pin every third-party action use to its commit SHA with a human-readable comment. Closes the floating-major- tag supply-chain exposure flagged in the audit. Security headers - src/main_app/__init__.py: extend the response hook so application/ json responses get Content-Security-Policy: default-src 'none'; frame-ancestors 'none'. Matches the lockdown already applied to /feed/* RSS. text/html CSP unchanged. - tests/integration/test_security_headers.py: flip test_csp_absent_on_json_responses -> test_csp_locked_down_on_json_ responses to assert the new presence + values. Imports - src/api/episodes.py: hoist audio_peaks, chapters_generator, llm_client, processing_queue out of inline function calls. main_app.processing and ad_detector stay inline (circular). - src/api/feeds.py: hoist urllib.parse.urlparse and defusedxml.ElementTree (aliased DefusedET to avoid shadowing the existing xml.etree.ElementTree alias used for OPML export). - src/audio_fingerprinter.py: replace two inline `import acoustid` calls with one module-level guarded import; methods now branch on `acoustid is None`. - src/processing_timeouts.py: investigated; hoisting still triggers the circular import path through status_service. Kept inline (audit finding was a false positive). Dead code - src/ad_detector.py: drop the INVALID_SPONSOR_REASONS class alias (kept "for old name" per its own comment); two call sites updated to use module-level INVALID_SPONSOR_VALUES directly. ASCII sweep - 66 U+2014 em-dashes and 1 U+2605 star replaced across 26 files (mostly comments, docstrings, a few <option> labels in PatternImportDialog). No smart quotes were found. Future regressions are now blocked by the pre-commit ASCII guard. Verification: 1177 pytest passed, frontend tsc clean. * chore(detection): drop dead network_id parameter from process_transcript The audit flagged a `network_id` parameter that was plumbed three layers deep but never set by any caller of AdDetector.process_transcript. Confirmed: both call sites (src/main_app/processing.py:314 and src/api/episodes.py:837) pass only podcast_id, never network_id. The forward to text_pattern_matcher.find_matches drops the kwarg; that signature keeps network_id because pattern_service.py and other matcher entry points genuinely use it. The second audit finding for this phase -- "podcast_id=slug is mis-scoped" -- turned out to be wrong. ad_patterns.podcast_id is declared TEXT (no FK to podcasts.id; the schema explicitly survives content deletion), and patterns are both written and read using the slug. The field is named misleadingly but the equality check in _filter_patterns_by_scope is consistent. No fix needed. * chore: phase-3 helper consolidation (backend utils + frontend hooks) Consolidate duplicated logic surfaced by the audit. Behavior-preserving. Backend - src/utils/ttl_cache.py: single TTLCache(ttl_seconds) class. Three modules now share it: cleanup_service (settings cache, 300s), llm_client (provider cache, preserving the existing threading.Lock and a _CACHED_NONE sentinel so None results don't re-query), sponsor_service (freshness gate; instance dicts retained because they're read across many methods). - src/utils/time.py: new overlap_ratio(start_a, end_a, start_b, end_b) and ranges_overlap(... , tolerance=0.0). ad_detector._compute_overlap and audio_analysis.base.AudioSegmentSignal.overlaps now one-line wrappers (signatures preserved for test introspection); ad_validator._overlaps_corrections drops its inline math. - src/utils/constants.py: NON_BRAND_WORDS (formerly ad_detector ._NON_BRAND_WORDS) moved to the central constants module with a docstring noting INVALID_SPONSOR_CAPTURE_WORDS is intentionally separate (capture-start filler vs. brand-disqualifier). Alias kept in ad_detector for any in-file references. - src/sponsor_service.py: gains extract_sponsor_from_reason(text) and extract_sponsors_from_transcript(text, ad_reason). The two ad_detector duplicates now delegate. extract_sponsor_names module shim retained because tests import it directly. Frontend - frontend/src/components/Artwork.tsx + ARTWORK_FALLBACK_SVG: replaces the 200-char inline data: SVG that was copied into FeedCard, FeedListItem, EpisodeDetail, FeedDetail. - frontend/src/hooks/useLocalStorageState.ts: generic typed useState+localStorage with a legacy raw-string fallback so existing on-disk values don't get re-parsed. Migrated Dashboard, CollapsibleSection, EpisodeDetail. - frontend/src/hooks/useSyncFromQuery.ts: render-phase seed hook. Settings.tsx three near-identical snapshot-vs-data blocks now collapse to three useSyncFromQuery calls. Kept render-phase rather than useEffect-based per existing project history that intentionally moved this off useEffect. - frontend/src/components/Layout.tsx: NAV_ITEMS table + NavLink subcomponent. 234 -> 156 LOC. lastPath state replaced with useEffect([location.pathname]). - frontend/src/utils/detectionStage.ts: DETECTION_STAGE_META lookup replaces the stringly-typed ternary chain in EpisodeDetail.tsx:500. - frontend/src/api/client.ts: apiFileRequest helper. Four blob downloads (settings, history, feeds-OPML, patterns) drop their hand-rolled Content-Disposition / error-shape parsing. feeds.ts importOpml is a JSON-response upload (not a blob download) so it stays on raw fetch but now uses the shared extractErrorMessage. Verification: 1177 pytest passed, frontend tsc clean. * refactor: phase-4 type tightening (PatternScope, EpisodeStatus, SettingEntry) Three audit findings: stringly-typed fields and dict-shape leaks. Behavior-preserving; the enum value's str subclass means existing literal comparisons keep working. Frontend - PatternScope = 'podcast' | 'network' | 'global' exported from frontend/src/api/patterns.ts. Four bare `string` sites on AdPattern, PatternStats, getPatterns params, and updatePattern updates now reference it. AdEditor.DetectedAd.scope also tightened. PatternCorrection.scope stays narrower ('podcast' | 'global') because user-driven create-mode only exposes those two. Backend - src/utils/constants.py: new EpisodeStatus(str, Enum) covering the seven values that actually appear in the codebase (discovered, pending, processing, processed, failed, permanently_failed, plus the API-only 'completed' alias that api/episodes.py substitutes for 'processed' in responses). Source of truth pulled from the schema CHECK constraint. - Migrated the high-traffic mutation sites: src/api/episodes.py (11 sites), src/main_app/processing.py (6 sites, leaving 2 in record_processing_history whose table uses a different vocab), src/main_app/routes.py (6 sites), src/database/episodes.py (1), src/main_app/background.py (1). Raw SQL string literals left alone -- moving those needs parameter binding, no type-safety benefit at the SQL boundary. - src/api/settings.py: new frozen SettingEntry(value, is_default) dataclass + _settings_view() wrapper. _setting_value and _setting_is_default now accept either SettingEntry or the legacy dict shape so the seven consumers of db.get_all_settings() outside this module aren't affected. Scope-guard applied: wrap at the API-helper layer rather than changing the producer. Verification: 1177 pytest passed, frontend tsc clean. * chore: phase-7+8 frontend mid-tier + remaining security follow-ups Eight items from the audit's lower-priority pile, batched. Security - src/api/auth.py: /auth/logout now manually validates the CSRF token on top of the existing AUTH_EXEMPT_PATHS exemption. Unauthenticated callers still bypass (csrf.validate returns None when session is unauthenticated), preserving the "always callable to clear stale state" property. Closes the cross-site-logout finding. - src/community_sync.py: _fetch_manifest now routes through safe_http.safe_get(trust=URLTrust.OPERATOR_CONFIGURED, stream=True) + read_response_capped(resp, 256*1024). Matches the SSRF posture already used in rss_parser. The manifest URL is a build-time constant today, but the wrapper protects any future setting that exposes it. - src/rss_parser.py: DOCTYPE pre-scan widened from 4 KB to 64 KB so a feed with extensive leading whitespace/comments still triggers the xml_forbidden_construct warning. defusedxml still gates XXE either way. - entrypoint.sh: added -xdev to both find calls and a pre-check that emits a WARN line (continues, doesn't abort) when DATA_DIR is owned by a uid that's neither 0 nor APP_UID. Warning-only first release so we see how it behaves on existing volumes before tightening to exit 1. Code consolidation - src/llm_client.py: extracted shared retry/circuit-breaker skeleton into LLMClient._send_with_fallback. The two concrete messages_create methods (Anthropic, OpenAI-compatible) now define a local _send closure and delegate the bookkeeping. ~60 LOC of duplicate try/except gone. Request kwargs still per-provider since Anthropic and OpenAI shapes diverge. - src/ad_detector.py: replaced the five lazy @Property accessors (db, audio_fingerprinter, text_pattern_matcher, pattern_service, sponsor_service) with a single _ensure_deps() called from initialize_client and the four DB-reading getters. Attributes are now plain. Removes the thread-race window on first access. Test fixtures in tests/unit/test_ad_detector* updated to match the plain-attribute surface. Frontend - frontend/src/pages/StatsPage.tsx: SortTh replaced with a makeSortTh factory closure-bound to sortField/sortDir/handleSort via useMemo. Eight call sites lose three props each. - frontend/src/pages/Settings.tsx: dropped the useEffect that was force-writing 'settings-section-system-status' to true on every load (overriding the user's collapse choice). CollapsibleSection.defaultOpen already covers the same need cleanly. Verification: 1177 pytest passed, frontend tsc clean, bash -n entrypoint.sh ok. * refactor(api): decompose update_ad_detection_settings + pattern handlers PR 5.1 + 5.2 from the audit follow-up plan. src/api/settings.py: update_ad_detection_settings dropped from a 263-line handler to 34 lines of dispatch. The audit's reference to nested _tu/_sv closures was stale -- those were already lifted to _apply_stage_tunables in a prior pass -- but the remaining sequential `if 'key' in data:` chain has now been split into nine named phase helpers that each take (db, data) and return None or an error response: _apply_prompt_fields, _apply_review_fields, _apply_model_fields, _apply_processing_flags, _apply_min_cut_confidence, _apply_provider_fields, _apply_whisper_fields, _apply_vad_gap_fields, _apply_podcast_index_fields. Handler iterates a tuple of callables; ordering and every set_setting / error string / log message is preserved verbatim. src/api/patterns.py: three large route bodies decomposed. - submit_correction: 247 -> 44 LOC. Per-action work moved to _handle_confirm_correction, _handle_reject_correction, _handle_adjust_correction, _maybe_rewrite_pattern_from_adjustment, with shared _resolve_or_create_pattern_from_text covering the dedup-or-create-from-text path (label param toggles log wording). - _submit_correction_create: 120 -> 77 LOC. Pulled out _validate_create_correction_input and _insert_manual_marker. - import_patterns: 136 -> 40 LOC. Pulled out _validate_import_request, _validate_import_items, _upsert_import_pattern, _apply_pattern_imports. Behavior preservation: - 1177 pytest passed (baseline) -> 1177 pytest passed (after). - Pattern-specific suites (84 tests) and settings-specific suites (15 tests) all pass. - Route signatures, JSON contracts, error message strings, log message strings, validation rules, and ordering all unchanged. - inline imports of PatternService inside _handle_adjust_correction preserved (original called it twice with the same shape; keeps mock-based test timing identical). * refactor(detection): decompose learn_from_detections into 4 phase helpers PR 5.3 from the audit follow-up. HIGH-risk per the audit notes because pattern memory persists across episodes -- a regression would only surface on the NEXT episode encountering a similar ad. Held to behavior-byte-identical: same signature, same control flow, every log call preserved verbatim, same active_pattern_sponsors preload position. src/ad_detector.py: learn_from_detections shrank 151 -> 52 LOC. Logic moved into four private methods on AdDetector: - _ad_passes_learning_filters(ad, min_confidence): was_cut + stage + confidence floor + long-ad stricter threshold (34 LOC). - _resolve_sponsor_for_learning(ad): 4-tier sponsor resolution + canonicalization (37 LOC). - _sponsor_blocked_by_gates(sponsor, active_pattern_sponsors): Gate A (prefix-of-known) + Gate B (unknown short single word) (34 LOC). - _create_pattern_and_fingerprint(...): create_pattern_from_ad + optional fingerprint store + per-ad try/except (41 LOC). The orchestrator now reads as the audit-described pipeline: filter -> resolve sponsor -> gates -> create. Verification: 1177 pytest passed (baseline) -> 1177 passed (after); test_ad_detector_learn_from_detections.py 7/7; test_ad_detector.py 30/30. No log line changed, no exception swallow widened. * refactor(processing): decompose four pipeline-stage helpers PR 5.4 from the audit follow-up. HIGH-risk per the audit because these are the actual production ad-removal pipeline; a regression would surface as wrong cuts in real episodes, not as a unit-test failure. Held to behavior-byte-identical: public signatures unchanged, every audio_logger call preserved verbatim, lazy-import pattern preserved, component-resolution order unchanged (helpers take db / storage / ad_detector explicitly rather than re-calling _get_components per helper). src/main_app/processing.py: - _run_ad_reviewer 95 -> 67 LOC. Extracted _apply_reviewer_verdict_to_ad, _merge_reviewer_result. - _refine_and_validate 124 -> 67 LOC. Extracted _refine_boundaries, _apply_heuristic_rolls, _load_user_corrections, _gate_validation_by_confidence. - _finalize_episode 109 -> 26 LOC. Extracted _persist_episode_state, _refresh_rss_for_slug, _log_completion_summary, _record_history_and_event. Big drop because the function was almost entirely sequential side effects with no shared local state between phases. - _run_verification_pass 157 -> 93 LOC. Extracted _apply_pass2_heuristic_rolls, _validate_verification_ads, _gate_verification_ads_by_confidence, _recut_processed_audio. Did not push under 90 because the outer try/except wrap + nested control flow doesn't benefit from further splitting. Verification: pytest at 1177 passed after each individual function decomposition (four checkpoints) and 1177 passed after the final state. No log line text changed; lazy imports preserved. * refactor(processing): EpisodeContext dataclass for pipeline plumbing PR 5.5 from the audit follow-up. Holds the immutable per-episode context shared across the detection pipeline so each stage helper can take one object instead of a long parameter list. Mutable plumbing (progress_callback, cancel_event, audio_analysis, audio_path, skip_patterns, segments) stays explicit because it isn't "context". New file src/main_app/episode_context.py: - @DataClass(frozen=True) EpisodeContext with: slug, episode_id, podcast_name, episode_title, podcast_id, podcast_description, episode_description, podcast_tags. src/ad_detector.py: - process_transcript gains a keyword-only `ctx=None` parameter that unpacks into the existing locals when supplied. Old positional signature preserved so api/episodes.py:838 doesn't need to change. - When ctx is provided, podcast_id for pattern-scoping is set from ctx.slug (matches the long-standing call convention of passing slug-as-podcast_id; ad_patterns.podcast_id is TEXT, not the integer FK, see prior commit's note). src/main_app/processing.py: - process_episode builds ctx once after stage-2 audio analysis and threads it through the migrated helpers. Podcast-row + tags lookup consolidated to a single site (was inside _detect_ads_first_pass). - _detect_ads_first_pass: 11 -> 6 params. - _run_verification_pass: 13 -> 7 params. - _apply_pass2_reviewer: 12 -> 6 params. Out of scope by design: _run_ad_reviewer / _build_episode_meta (uses integer DB PK podcast_id and depends on a separate podcast_row fetch right before the call -- moving it would shift the DB read upstream, which the "Don't dismiss race warnings" feedback memo cautions against). _process_episode_background (9 params; all are external IO forwarded to process_episode, not pipeline-internal). Verification: 1177 pytest passed (baseline) -> 1177 passed (after). Diff: 3 files, +96 / -33 LOC, well under the 400-LOC scope cap. * refactor: phase-6 structural splits (ad_detector, schema, AdReviewModal) Three big-file decompositions from the audit. All HIGH-risk per the plan. Strategy was pure code-motion only -- git mv for the central renames so blame survives, no behavior change, no signature or log-line edits, no SQL DDL byte changes. PR 6.1: src/ad_detector.py -> src/ad_detector/ package. - __init__.py 1403 LOC: AdDetector class + explicit named re-exports covering all 28 public names tests and production import (audited via `grep "from ad_detector import"` repo-wide, including the underscored ones tests reach into directly: _NON_BRAND_WORDS, _extract_ad_keywords, _find_keyword_region, _unpack_region, _text_has_ad_content, _find_json_array_candidates). - boundaries.py 857 LOC: refine/extend/snap/merge/validate/dedupe helpers + transition phrases. - prompts.py 393 LOC: USER_PROMPT_TEMPLATE, create_windows, format_window_prompt, get_static_system_prompt, parse_ads_from_response. - Central file dropped 2575 -> 1403 LOC (-45%). Net +78 LOC from added module docstrings and duplicated imports. - Git records: R src/ad_detector.py -> src/ad_detector/__init__.py. PR 6.2: src/database/schema.py -> src/database/schema/ package. EXTREMELY HIGH risk per CLAUDE.md "never lose data on migrations". Pure code motion only. SHA-256 of SCHEMA_SQL and MIGRATION_INDEXES_SQL identical before/after the split (13aac307..., a73d3d14...). No MIGRATIONS registry exists in this codebase to reorder; the migration sequence is encoded in a single 770-line _run_schema_migrations method that was not touched. Only the two SQL DDL string constants moved out to tables.py (306 LOC); re-imported from __init__.py for back-compat. SchemaMixin class stays in __init__.py. - Migration-focused tests (test_migration_sponsor_fk, test_only_expose_processed_migration, test_settings_migration, test_database): 92 passed. PR 6.3: frontend/src/components/AdReviewModal.tsx decomposition. 3 of 6 audit-planned extractions: - frontend/src/utils/adReviewHelpers.ts: state-free helpers (parseTimeInput, formatTime, getThemeWaveformColors, loadPlayWhileDragging, savePlayWhileDragging, PLAY_WHILE_DRAG_KEY). - frontend/src/components/ad-editor/Pin.tsx: Pin subcomponent + PinProps interface. Self-contained, owns its own dragging state. MIN_AD_DURATION=1.0 duplicated locally to avoid a circular import; comment in Pin.tsx flags the two-site invariant. - frontend/src/components/ad-editor/usePeaks.ts: peaks fetch/cancel effect + the three peaks state vars. setPeaks(null) on resetView dropped because the hook already clears+re-fetches on resetTick. - AdReviewModal.tsx: 1654 -> 1448 LOC (-12.5%). Default export and named exports unchanged. Scope-limited per the audit's "if extraction causes subtle reactivity changes, revert" guard: - useWavesurfer NOT extracted -- wavesurfer mount deliberately omits adStart/adEnd from deps (eslint-disable); wrapping would either reintroduce deps (changes reactivity) or require ref-passing that complicates the surface. - AdReviewToolbar + Cursor + usePinDrag NOT extracted -- prop counts per component would balloon to 8-15 for marginal LOC savings. Verification: 1177 pytest passed, frontend tsc clean. * refactor(main_app): drop _get_components tuples in favor of direct imports PR 6.4 from the audit. Four files were doing positional unpacking of a 4-10 element tuple returned from a local _get_components() helper: - src/main_app/processing.py:35 10-tuple, 13 call sites (mostly _,_,_,...) - src/main_app/routes.py:67 4-tuple - src/main_app/background.py:13 5-tuple - src/main_app/feeds.py:31 5-tuple Tuple-reorder was a silent failure mode -- every caller used positional unpacks, so a reorder in one place would mis-bind globals everywhere. All four files now do `from main_app import db, storage, ...` at the top and reference singletons directly; the _get_components helpers are deleted. No circular import: singletons are instantiated in main_app/__init__.py:146-159 before the explicit submodule imports at lines 616-622, so any `import main_app.<sub>` triggers a full __init__.py run-through first and the names are bound. Test fallout fixed in the same commit: - tests/unit/test_feed_304_refresh.py: 2 tests were patching main_app.feeds._get_components; repointed to patch the individual singletons (db, rss_parser, storage, status_service, pattern_service) on main_app.feeds. - tests/unit/test_head_request.py: 8 tests were patching on the main_app package namespace; once routes.py binds those names locally via `from main_app import ...`, patches at the package level don't affect them. Repointed to main_app.routes.{...}. Verification: pytest at 1177 passed after each of the four file conversions (four checkpoints) and at the final state. Net -32 LOC. * fix(ci): unbreak eslint static-components + codeql py/reflective-xss Two CI failures from PR 232: 1. eslint react-hooks/static-components (9 errors in StatsPage.tsx). The Phase 7+8 makeSortTh factory returned a component during render, which the rule rejects because per-render component identity churn forces React to remount every <SortTh /> on each state change. Reverted SortTh to a stable module-level function component that takes sortField/sortDir/onSort as explicit props. The eight call sites pass them through; the win the audit asked for (no per-instance click handler rebuilding via closure factory) is now achieved via React's normal prop equality. 2. CodeQL py/reflective-xss (HIGH) flagged src/api/patterns.py:1065 inside _validate_import_request after the Phase 5.2 decomposition. The path was a false positive (json_response uses jsonify which HTML-escapes), but CodeQL's interprocedural analysis lost line-of-sight when patterns/mode moved behind a helper that received the raw `data` dict. Hoisted the data.get() extraction back to the caller; the validator now takes already-extracted (patterns, mode) args, so the request -> validate -> error_response flow is visible inline at import_patterns. _IMPORT_MODES tuple constant replaces the inline literal list (one source of truth for the accepted values). Verification: 1177 pytest passed, frontend tsc + eslint clean. * fix(ci): restore Layout render-phase pathname compare + inline mode validation Second round of CI fixes on PR 232. 1. Layout.tsx -- react-compiler eslint rule rejects setState inside useEffect ('Calling setState synchronously within an effect can trigger cascading renders'). The Phase 3 change replaced the original render-phase lastPath compare with the standard useEffect([location.pathname]) pattern. The React docs explicitly recommend the render-phase idiom for 'reset state on prop change' (https://react.dev/learn/you-might-not-need-an-effect#adjusting- some-state-when-a-prop-changes), and the new lint rule enforces that recommendation. Restored the original pattern; dropped the now-unused useEffect import. 2. src/api/patterns.py -- CodeQL py/reflective-xss still tripped on the helper-returned 'mode' even after hoisting data.get() to the caller. CodeQL's interprocedural analysis won't trust a narrowing that happens inside a called function. Replaced the helper-side mode validation with an inline if/elif/else chain at the request boundary that binds the value to one of three literal strings. That gives CodeQL a direct line-of-sight from request.get_json() to a bound check it can prove. _validate_import_request still takes mode as an arg but no longer makes its own membership check (the caller already did). Verification: 1177 pytest passed, frontend tsc + eslint clean. * fix(ci): inline the import-patterns response paths so CodeQL sees the literal mode Third CI fix. CodeQL py/reflective-xss kept tripping on the empty-patterns no-op response inside _validate_import_request, which echoed `{'mode': mode, ...}`. mode is constrained to one of three literal strings by the inline if/elif/else in the caller, but CodeQL's interprocedural analysis won't trust that narrowing across a helper boundary -- it sees a tainted value flowing into a response body. Moved the no-op response and the empty-replace 400 directly into import_patterns. The helper now only computes a boolean predicate (_is_empty_replace_request). All response construction happens at the caller, where mode is provably one of 'merge' / 'replace' / 'supplement' from the inline narrowing earlier in the function. The data flow is now linear and visible: data.get('mode') -> bound check -> response body. Verification: 1177 pytest passed. * chore: bump to 2.4.9 for the audit-cleanup release Release notes consolidated under [2.4.9] in CHANGELOG.md. The release ships the security HIGH fixes (session rotation, fail-loud secret key, CSRF on logout, JSON CSP, community_sync safe_http, entrypoint hardening) plus 50+ structural refactors held to behavior-byte-identical. Full pytest 1177 + frontend tsc clean at every commit boundary on this branch. openapi.yaml info.version bumped in sync with version.py per the project rule. * test(smoke): phase B - cover the three 2.4.9 security additions Three new local smoke tests that exercise the security HIGHs landed earlier in this PR. Verified against the 2.4.9 image (ttlequals0/minuspod:2.4.9) running as the smoke container. T19-session-rotation (3/3 passing locally) - Seed a session cookie on an unauth GET /auth/status. - Login carrying that cookie. - Assert the post-login Set-Cookie session= value differs from the pre-login one. Flask's signed cookies are deterministic for the same session payload, so a non-rotated session would produce identical strings; difference proves session.clear() ran before the re-init. Also verifies the new cookie successfully authenticates a follow-up /auth/status. T20-logout-csrf (5/5 passing locally) - POST /auth/logout without X-CSRF-Token -> 403 (rejected). - Authenticated state preserved across the rejected attempt. - POST /auth/logout with X-CSRF-Token -> 200. - Authenticated state cleared after the successful logout. - Unauthenticated POST /auth/logout still returns 200 (the "always callable to clear stale state" property is preserved; csrf.validate returns None when session.authenticated is False). T21-json-csp (4/4 passing locally) - GET JSON endpoint includes Content-Security-Policy: default-src 'none'; frame-ancestors 'none'. - HTML /ui/ CSP still present (catches regressions in the text/html branch caused by the JSON addition). - JSON error response (POST /auth/login with empty body) also ships the locked-down CSP. Wired into run-all.sh's local sequence between 18-multi-worker and 17-shutdown (17 runs last because it kills the container). Not covered here (deferred by design): - SECRET_KEY fail-loud at startup: hard to exercise from outside the container without sabotaging the data dir; covered better by a unit test against get_or_create_secret_key directly. * test(smoke): phase C - pattern system end-to-end + per-test IP isolation Five new local smoke tests covering the pattern endpoints that were previously untouched (~21 of 30 uncovered routes). Verified against the 2.4.9 image (ttlequals0/minuspod:2.4.9). All 8 Phase B+C tests pass 44/44 assertions in a clean sequential run. T22-pattern-import-modes (4/4) - merge / supplement modes round-trip. - replace with empty array -> 400 (fat-finger guard). - invalid mode -> 400. T23-pattern-readonly (10/10) - /patterns/stats, /patterns/health, /patterns/contaminated return well-shaped JSON. - /patterns?scope=, /patterns?source=, /patterns?active=false filter without erroring. - /patterns/deduplicate and /patterns/backfill-false-positives return 200 (idempotent on a clean DB). T24-pattern-crud (9/9) - Per-pattern CRUD: import -> GET <id> -> PUT update -> verify persisted. GET on missing id -> 404. - bulk-disable + bulk-delete with their expected_count guard and the without-confirm 400 path. T25-pattern-community (3/3) - /patterns/preview-export returns ready/rejected shape. - /patterns/submit-bundle returns a JSON bundle. - GET /community-patterns/sync-status returns 200. - (/community-patterns/all is DELETE-only and destructive; not exercised here.) T26-pattern-edge-cases (6/6) - /patterns/<id>/split on a non-contaminated pattern -> 400. - /patterns/merge with empty source_ids -> 400. - /patterns/<id>/protect on a local-source pattern -> 400 (community-only). - protect on missing id -> 404. - unprotect on a local pattern -> 200 (intentionally idempotent; no source guard on the DELETE path). - submit-to-community on missing id -> 400 (the export pipeline raises ExportError before any 404 lookup). Per-test IP isolation (smoke/lib/common.sh) - New smoke_ip helper returns a per-test, per-run unique private IPv4 (10.<test>.<derived>.<derived>) for spoofing in X-Forwarded-For. /patterns/import has a 3/hour per-IP rate limit and /auth/login a 3/min per-IP limit; reusing the same IP across runs starves the quota and tests fail for the wrong reason. - login() helper extended with an optional 4th X-FF arg; all of T19-T26 now log in via a per-test unique IP. Smoke covers (after this commit): - T19 session rotation, T20 CSRF on /auth/logout, T21 JSON CSP (Phase B, 2.4.9 security additions). - T22-T26 pattern system surface (Phase C). - Pre-existing T14 still tests the export -> reimport happy path. Run-all sequence updated to include 22-26 between 21-json-csp and 17-shutdown so 17 stays last (it kills the container). * refactor(smoke): apply /simplify findings to phase B+C tests Three review agents (reuse / quality / efficiency) ran against the Phase B+C smoke tests and surfaced ~200 LOC of duplication plus dead code in T22. All 8 tests still pass 44/44 after the refactor. New helpers in smoke/lib/common.sh - setup_authed_jar <test_num>: create jar, login from a per-run unique X-FF IP, grab CSRF, register an EXIT trap so the jar is cleaned up even when an assertion aborts the script. Exports JAR and CSRF globals. - auth_json <method> <path> <test_num> <body>: authenticated JSON-body request returning the response body on stdout. Bundles Content-Type + X-Forwarded-For + X-CSRF-Token in one call. - auth_code <method> <path> <test_num> [body]: same shape but returns the HTTP code instead of the body. Used when the assertion is "expect this status". - json_get <key>: read stdin, print the top-level JSON field null-safely. Replaces ~12 inline `python3 -c 'import json,sys; print(json.load(sys.stdin).get(...))'` blocks. - find_pattern_id_by_sponsor <sponsor>: look up the test pattern's id after import (the import endpoint returns counts only, not ids). Replaces 4 identical 8-line filter-by-sponsor blocks. - import_test_pattern <test_num> <sponsor> <text>: build the payload, import with mode=merge, return the assigned id. Pairs with bulk_delete_pattern for setup/teardown. - bulk_delete_pattern <test_num> <id>: cleanup helper satisfying the bulk-delete expected_count fat-finger guard. - header_value <name>: case-insensitive header extractor from a curl -i dump. Replaces 3x awk pipelines in T21. Dead code removed - smoke/local/22-pattern-import-modes.sh: deleted post_import and post_import_body helpers. Both were defined-but-unused; the caller at the empty-replace path invoked post_import_body with curl-style flags as positional args, then immediately overwrote the result with a fresh inline curl ("the wrapper above swallows curl args; redo cleanly"). The first invocation also burned a rate-limit slot for nothing. Replaced with auth_code helper. Assert tightening - T26 split-on-non-contaminated: was assert_in "400 404"; the pattern was just imported so 404 would mask a regression. Tightened to assert_eq 400. - T26 submit-to-community on missing id: same; the handler returns 400 with ExportError reasons, never 404. Tightened to assert_eq 400 with a comment explaining why. Test files dropped 696 -> 424 LOC (-272). common.sh +137 LOC for the helpers. Net -135 LOC, plus the duplication that made each new test feel like the previous one is gone. Verification: 8/8 tests still PASS, 44/44 assertions, in a clean sequential run against ttlequals0/minuspod:2.4.9. * test(smoke): phase D - settings system coverage (T27-T30) Four new local smoke tests covering the settings system end-to-end: 12/12 tests now pass 74/74 assertions in a clean sequential run against ttlequals0/minuspod:2.4.9. T27-settings-readonly (10/10) - GET /settings, /settings/models, /settings/whisper-models, /settings/retention, /settings/audio, /settings/processing-timeouts, /settings/reviewer, /settings/community-sync, /settings/webhooks, /networks return well-shaped JSON. T28-settings-roundtrip (8/8) - PUT /settings/retention (with negative-value 400 rejection). - PUT /settings/audio (both keepOriginalAudio values; missing-key 400). - PUT /settings/processing-timeouts (both values; missing-key 400). - Uses the new poll_for helper to handle 2-worker memory:// cache staleness (a PUT invalidates one worker's cache; a follow-up GET can land on the other worker). T29-settings-resets (4/4) - POST /settings/ad-detection/reset (idempotent on second call). - POST /settings/prompts/reset. - GET /settings still returns 200 after both resets. T30-webhook-crud (8/8) - POST /settings/webhooks/validate-template (happy + malformed). - POST /settings/webhooks creates with valid event name from VALID_EVENTS ("Episode Processed"). - PUT /settings/webhooks/<id> updates events. - GET /settings/webhooks lists the new hook. - POST /settings/webhooks/<id>/test returns {success: bool} (delivery to example.invalid will fail safely; we just check call shape). - DELETE /settings/webhooks/<id> + confirm gone. New helper in smoke/lib/common.sh - poll_for <endpoint> <key> <expected> [tries]: GET an endpoint up to N times, returning early when json_get(key) == expected. Promoted from a local T28 helper because future settings/community- sync round-trips and any other cache-backed endpoint will hit the same multi-worker staleness window. Run-all sequence updated to include 27-30 between 26 and 17-shutdown. /simplify review applied: - poll_for promoted to common.sh. - T29's raw curl for /settings shape check replaced with auth_code. - T30's /test endpoint call now goes through auth_json (was missing X-Forwarded-For and skipping per-IP rate-limit rotation). - T30's pass/fail control flow restructured from && / || chain to explicit if/else for clarity. - T28 has a comment explaining the True/False string coupling to json_get's Python `print(bool)` output. Found one real API contract discovery during test development: - Webhook events use title-case spaced names ("Episode Processed", "Auth Failure", "Episode Failed") per src/webhook_service.py VALID_EVENTS, not snake_case. Tests use the correct values. * test(smoke): phases E, F, G - episode/feed/search/history/docs/SSE Six new local smoke tests bringing Phase B+C+D+E+F+G to 18 tests covering ~70 API routes. All 18 pass clean against ttlequals0/minuspod:2.4.9 running locally. Phase E - episode + feed surface T31-episode-error-paths (15/15) - 7 per-episode GETs on missing slug/id -> 404/400. - 4 per-episode mutations (reprocess, regenerate-chapters, cancel, retry-ad-detection) on missing -> 404/400. - alias POST /episodes/<slug>/<id>/reprocess -> 404/400. - bulk endpoint with empty payload -> 400/404. - 2 feed-level mutations on missing slug -> 404/400. T32-feed-listing (4/4) - GET /feeds returns list shape (empty on fresh DB). - GET /episodes/processing returns 200. - POST /feeds/refresh no-op on empty fleet. - GET /feeds/export-opml returns 200. Phase F - search / history / docs / liveness T33-search (4/4) - GET /search?q=... returns 200. - GET /search/stats returns dict shape. - POST /search/rebuild idempotent. - /podcast-search accepts 200 / 4xx / 5xx (smoke env has no Podcast Index key, so accept either configured or unconfigured behavior). T34-history (3/3) - GET /history list shape. - GET /history/stats returns 200. - GET /history/export returns 200 (downloadable blob). T35-docs-and-liveness (6/6) - /api/v1/docs and /docs/ Swagger UI 200/redirect. - /openapi.yaml first line is `openapi:`, info.version is SemVer. - /health/live cheap liveness probe. - /api/v1/health deep readiness probe reports status=healthy. Phase G - SSE T36-sse-stream (3/3) - Unauthenticated /status/stream emits `event: auth-failed` and the server closes the stream (verified via curl --max-time 3s). - Authenticated /status/stream emits the initial `data:` snapshot. - Authenticated stream does NOT emit auth-failed. Notable fix during T35 development: piping a 170+ KB openapi.yaml through `head -1 | grep` races head's pipe-close vs printf's write and triggers SIGPIPE 141, which set -o pipefail turns into a false- negative assertion failure. Replaced the pipe with `curl -o file` plus pure-bash parameter expansion and `awk` field extraction. Run-all sequence updated to include 31-36 between 30 and 17-shutdown. * fix(smoke): auto X-FF on login + checkpoint WAL before docker cp + lockout pacing + XXE log regex Final fixes after running the full smoke suite end-to-end. All 36 tests (T01-T36) plus 00-setup and 99-teardown now PASS, 0 FAIL, 2 SKIP (T08 artwork + T12 RSS public paths, both by design on a fresh DB without fixture data). smoke/lib/common.sh - login() auto-X-FF - Default xff (when caller passes none) is now `smoke_ip <test_num>` derived from TEST_NAME. Without this, the cumulative /auth/login calls from 30+ tests all hit the loopback IP and the 10/hour per-IP cap kicks in around test 11, making every downstream test fail with 401 instead of the assertion it's trying to verify. smoke/local/13-backup.sh - WAL checkpoint - The provider_crypto_salt that the backup endpoint persists may still be buffered in podcast.db-wal at docker-cp time. Now run `python3 -c sqlite3.PRAGMA wal_checkpoint(FULL)` via docker exec before copying so the salt is in the main DB file (the runtime image has python3 but not the sqlite3 CLI). smoke/local/11-lockout.sh - 65s gap before 6th attempt - The lockout test paces 5 wrong-password attempts 22s apart, then fires the 6th to verify the 429+Retry-After response. The cumulative 5 attempts fit flask-limiter's 3/min sliding window, but the 6th attempt was getting 429 from the limiter (which doesn't include Retry-After) before the route handler's lockout check could run. Added a 65s sleep before the 6th attempt so the limiter window clears and the lockout path actually fires. smoke/local/07-xxe.sh - accept SSRF-blocked log line - The smoke XXE feed is hosted at <gateway>:18074 (a non-standard port). The SSRF validator blocks non-standard ports before the XML parser ever sees the body, so the defusedxml log line never fires. The defense is still working (verified by /etc/passwd absence), so the regex now accepts "SSRF blocked" alongside "xml forbidden construct" as evidence of rejection. End-to-end suite run: 36 PASS, 0 FAIL, 2 SKIP. 100% pass rate against ttlequals0/minuspod:audit-cleanup running locally. * fix(settings/ui): disable Test on unsaved drafts + explain why input clears after Save Two UX fixes to ProviderKeyField (shared by LLM Provider section AND the Transcription whisper-API key) that close a confusion path real users have hit: 1. Test was clickable on a typed-but-unsaved draft. The /providers/<provider>/test endpoint reads the SAVED key from the DB, not the React draft state. Clicking Test before Save returned "no key configured" even though the input clearly held a value, leading users to file "field not set" bug reports. Now: when `draft.length > 0 && status.source !== 'db'`, the Test button is disabled and carries a title tooltip: "Click Save first -- Test reads the saved key, not the unsaved draft." 2. After Save the input cleared with no explanation. The field is wiped post-save because the key is stored encrypted and we don't re-display secrets. The placeholder ("(stored - enter new value to change)") explained this but was easy to miss, so the blank input looked like "save erased my key" -- the other common user report. Now: a green inline "Saved -- input cleared because keys are stored encrypted" notice appears next to the action buttons for 4s on successful save. Verified both with Playwright against the local 2.4.9 image: - openrouter (status.source='none') with typed draft -> Test disabled with tooltip text matching the fix. - After Save -> input clears, "Stored encrypted" chip appears, green notice "Saved -- input cleared because keys are stored encrypted" renders. Subsequent Test runs reach the backend (returns connection failed against the fake endpoint, as expected). The fix is a one-file component change; LLM provider keys (anthropic, openrouter, openai, ollama) and the whisper API key all use the same ProviderKeyField component and get the new behavior uniformly. * fix(settings): close GH-234 -- Base URL Test failure, vanishing Save Changes, hydration/compare mismatch Three real bugs from issue #234 ("Unable to set LLM Provider base URL in the UI"). Verified end-to-end with Playwright after each fix. 1. Test reports "base URL not set" even after the user saves ----------------------------------------------------------- Root cause: the "Save" button next to the API key only hit PUT /providers/<name> with apiKey; the Base URL field's value was parked in React state, saved only by the global Save Changes button at the top of the page. A user who clicked Save next to the key (the obvious target) had the key persisted but no base URL, so the next Test failed with "base URL not configured". Fix: handleProviderKeySave now includes the current openaiBaseUrl (or whisperApiConfig.baseUrl for the whisper provider) in the same PUT body. One Save covers both. Settings.tsx:94-105. 2. Save Changes button vanishes after a successful save ----------------------------------------------------------- Root cause: get_effective_base_url / get_effective_provider read through a 5-second TTL cache (src/llm_client.py:_provider_cache). Settings PUT writes the DB but did NOT invalidate this cache. The post-mutation GET /settings refetch fired within the 5s window, returned the pre-write value, the UI re-hydrated state to the stale value, hasChanges flipped back to false, and the Save Changes button disappeared. Fix: new invalidate_provider_cache() helper in llm_client.py; called from _apply_provider_fields and both /providers/<name> PUT/DELETE handlers. 3. Save Changes always visible on a fresh load ----------------------------------------------------------- Root cause (discovered while verifying #2): the hydration block used `|| ''` for prompt/model fields while the compare block used `|| settings.defaults.X`. When the server stored an empty string and the default was non-empty (e.g. claudeModel saved as "" but default "claude-sonnet-4-5..."), hasChanges was true on load with no edits. Fix: aligned every compare line in computeChangedFields with the exact fallback used in the hydration block. Verification (Playwright against locally-built 2.4.9): - Save next to API key sends {apiKey, baseUrl}; Test now reaches the backend and returns "connection failed" against fake endpoint. - Post-save refetch sees fresh value; hasChanges flips false; Save Changes hides correctly. - Fresh load: Save Changes HIDDEN (no spurious dirty state). - Editing only Base URL -> Save Changes APPEARS. Frontend bundle rebuild on the next image build; static/ui is gitignored.

ttlequals0 added 2 commits May 15, 2026 22:49

ttlequals0 merged commit 9ce089c into main May 16, 2026
13 checks passed

ttlequals0 deleted the fix/benchmark-distinct-colors branch May 16, 2026 02:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(benchmark): assign unique color per model in scatter plots#231

fix(benchmark): assign unique color per model in scatter plots#231
ttlequals0 merged 2 commits into
mainfrom
fix/benchmark-distinct-colors

ttlequals0 commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ttlequals0 commented May 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant